In class ex 5

Objective

In this exercise, we will be building a logistic regression model for the water point status at Osun state, Nigeria.

Importing Packages

pacman::p_load(sf, tidyverse, funModeling, blorr, corrplot, ggpubr, spdep,GWmodel, tmap,skimr,caret,report)

Importing the Analytical Data

Osun <- read_rds('rds/Osun.rds')
Osun_wp_sf <- read_rds('rds/Osun_wp_sf.rds')

osun contains the polygon boundaries while Osun_wp_sf contains the water point data in Osun Nigeria.

The rds files have been preprocessed - eg. cleaning up of variables and variable names.

Next, we check the status field of the Osun_wp_sf sf dataframe object. This field is derived from the original status_clean field where

  • observations that are null are filtered away

  • remaining values that indicate water points are functional are labelled T

  • else they are labelled F

Osun_wp_sf %>%
  freq(input='status')
Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
ℹ The deprecated feature was likely used in the funModeling package.
  Please report the issue at <]8;;https://github.com/pablo14/funModeling/issueshttps://github.com/pablo14/funModeling/issues]8;;>.

  status frequency percentage cumulative_perc
1   TRUE      2642       55.5            55.5
2  FALSE      2118       44.5           100.0
tmap_mode("view")
tmap mode set to interactive viewing
tm_shape(Osun)+
  tmap_options(check.and.fix = TRUE) +
  tm_polygons(alpha = 0.4) +
  tm_shape(Osun_wp_sf) +
  tm_dots(col = "status",
          alpha = 0.6) +
  tm_view(set.zoom.limits = c(9,12))
tmap_mode("plot")
tmap mode set to plotting

Exploratory Data Analysis (EDA)

Osun_wp_sf %>%
  skim()
Warning: Couldn't find skimmers for class: sfc_POINT, sfc; No user-defined `sfl`
provided. Falling back to `character`.
Data summary
Name Piped data
Number of rows 4760
Number of columns 75
_______________________
Column type frequency:
character 47
logical 5
numeric 23
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
source 0 1.00 5 44 0 2 0
report_date 0 1.00 22 22 0 42 0
status_id 0 1.00 2 7 0 3 0
water_source_clean 0 1.00 8 22 0 3 0
water_source_category 0 1.00 4 6 0 2 0
water_tech_clean 24 0.99 9 23 0 3 0
water_tech_category 24 0.99 9 15 0 2 0
facility_type 0 1.00 8 8 0 1 0
clean_country_name 0 1.00 7 7 0 1 0
clean_adm1 0 1.00 3 5 0 5 0
clean_adm2 0 1.00 3 14 0 35 0
clean_adm3 4760 0.00 NA NA 0 0 0
clean_adm4 4760 0.00 NA NA 0 0 0
installer 4760 0.00 NA NA 0 0 0
management_clean 1573 0.67 5 37 0 7 0
status_clean 0 1.00 9 32 0 7 0
pay 0 1.00 2 39 0 7 0
fecal_coliform_presence 4760 0.00 NA NA 0 0 0
subjective_quality 0 1.00 18 20 0 4 0
activity_id 4757 0.00 36 36 0 3 0
scheme_id 4760 0.00 NA NA 0 0 0
wpdx_id 0 1.00 12 12 0 4760 0
notes 0 1.00 2 96 0 3502 0
orig_lnk 4757 0.00 84 84 0 1 0
photo_lnk 41 0.99 84 84 0 4719 0
country_id 0 1.00 2 2 0 1 0
data_lnk 0 1.00 79 96 0 2 0
water_point_history 0 1.00 142 834 0 4750 0
clean_country_id 0 1.00 3 3 0 1 0
country_name 0 1.00 7 7 0 1 0
water_source 0 1.00 8 30 0 4 0
water_tech 0 1.00 5 37 0 20 0
adm2 0 1.00 3 14 0 33 0
adm3 4760 0.00 NA NA 0 0 0
management 1573 0.67 5 47 0 7 0
adm1 0 1.00 4 5 0 4 0
New Georeferenced Column 0 1.00 16 35 0 4760 0
lat_lon_deg 0 1.00 13 32 0 4760 0
public_data_source 0 1.00 84 102 0 2 0
converted 0 1.00 53 53 0 1 0
created_timestamp 0 1.00 22 22 0 2 0
updated_timestamp 0 1.00 22 22 0 2 0
Geometry 0 1.00 33 37 0 4760 0
ADM2_EN 0 1.00 3 14 0 30 0
ADM2_PCODE 0 1.00 8 8 0 30 0
ADM1_EN 0 1.00 4 4 0 1 0
ADM1_PCODE 0 1.00 5 5 0 1 0

Variable type: logical

skim_variable n_missing complete_rate mean count
rehab_year 4760 0 NaN :
rehabilitator 4760 0 NaN :
is_urban 0 1 0.39 FAL: 2884, TRU: 1876
latest_record 0 1 1.00 TRU: 4760
status 0 1 0.56 TRU: 2642, FAL: 2118

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
row_id 0 1.00 68550.48 10216.94 49601.00 66874.75 68244.50 69562.25 471319.00 ▇▁▁▁▁
lat_deg 0 1.00 7.68 0.22 7.06 7.51 7.71 7.88 8.06 ▁▂▇▇▇
lon_deg 0 1.00 4.54 0.21 4.08 4.36 4.56 4.71 5.06 ▃▆▇▇▂
install_year 1144 0.76 2008.63 6.04 1917.00 2006.00 2010.00 2013.00 2015.00 ▁▁▁▁▇
fecal_coliform_value 4760 0.00 NaN NA NA NA NA NA NA
distance_to_primary_road 0 1.00 5021.53 5648.34 0.01 719.36 2972.78 7314.73 26909.86 ▇▂▁▁▁
distance_to_secondary_road 0 1.00 3750.47 3938.63 0.15 460.90 2554.25 5791.94 19559.48 ▇▃▁▁▁
distance_to_tertiary_road 0 1.00 1259.28 1680.04 0.02 121.25 521.77 1834.42 10966.27 ▇▂▁▁▁
distance_to_city 0 1.00 16663.99 10960.82 53.05 7930.75 15030.41 24255.75 47934.34 ▇▇▆▃▁
distance_to_town 0 1.00 16726.59 12452.65 30.00 6876.92 12204.53 27739.46 44020.64 ▇▅▃▃▂
rehab_priority 2654 0.44 489.33 1658.81 0.00 7.00 91.50 376.25 29697.00 ▇▁▁▁▁
water_point_population 4 1.00 513.58 1458.92 0.00 14.00 119.00 433.25 29697.00 ▇▁▁▁▁
local_population_1km 4 1.00 2727.16 4189.46 0.00 176.00 1032.00 3717.00 36118.00 ▇▁▁▁▁
crucialness_score 798 0.83 0.26 0.28 0.00 0.07 0.15 0.35 1.00 ▇▃▁▁▁
pressure_score 798 0.83 1.46 4.16 0.00 0.12 0.41 1.24 93.69 ▇▁▁▁▁
usage_capacity 0 1.00 560.74 338.46 300.00 300.00 300.00 1000.00 1000.00 ▇▁▁▁▅
days_since_report 0 1.00 2692.69 41.92 1483.00 2688.00 2693.00 2700.00 4645.00 ▁▇▁▁▁
staleness_score 0 1.00 42.80 0.58 23.13 42.70 42.79 42.86 62.66 ▁▁▇▁▁
location_id 0 1.00 235865.49 6657.60 23741.00 230638.75 236199.50 240061.25 267454.00 ▁▁▁▁▇
cluster_size 0 1.00 1.05 0.25 1.00 1.00 1.00 1.00 4.00 ▇▁▁▁▁
lat_deg_original 4760 0.00 NaN NA NA NA NA NA NA
lon_deg_original 4760 0.00 NaN NA NA NA NA NA NA
count 0 1.00 1.00 0.00 1.00 1.00 1.00 1.00 1.00 ▁▁▇▁▁

We will drop the variable install_year that has missing values and may not be useful in the logistic regression model later. We will be using the below variables.

Osun_wp_sf_clean <- Osun_wp_sf %>%
  filter_at(vars(status,
                 distance_to_primary_road,
                 distance_to_secondary_road,
                 distance_to_tertiary_road,
                 distance_to_city,
                 distance_to_town,
                 water_point_population,
                 local_population_1km,
                 usage_capacity,
                 is_urban,
                 water_source_clean),
            all_vars(!is.na(.))) %>%
  mutate(usage_capacity = as.factor(usage_capacity))

In the above code, we have also excluded missing values and recoded usage capacity to a categorical label.

Correlation Analysis

We select the required variables for plotting the correlation matrix and remove the geometry column.

Osun_wp <- Osun_wp_sf_clean %>%
  select(c(7,35:39,42:43,46:47,57)) %>%
  st_set_geometry(NULL)
cluster_vars.cor = cor(Osun_wp[,2:7])
corrplot.mixed(cluster_vars.cor,
               lower = 'ellipse',
               upper = 'number',
               diag = 'l',
               tl.col = 'black'
               )

None of the variables are highly correlated to any other variable, so we will be keeping all variables for the logistic regression model.

Building a logistic regression model

model <- glm(status ~ distance_to_primary_road +
               distance_to_secondary_road +
               distance_to_tertiary_road +
               distance_to_town +
               is_urban +
               usage_capacity + 
               water_source_clean + 
               water_point_population + 
               local_population_1km, 
             data = Osun_wp_sf_clean,
             family = binomial(link = 'logit'))

We use blr_regress() of blorr to generate the model report in scientific literature reporting format.

# report(model)
blr_regress(model)
                             Model Overview                              
------------------------------------------------------------------------
Data Set    Resp Var    Obs.    Df. Model    Df. Residual    Convergence 
------------------------------------------------------------------------
  data       status     4756      4755           4745           TRUE     
------------------------------------------------------------------------

                    Response Summary                     
--------------------------------------------------------
Outcome        Frequency        Outcome        Frequency 
--------------------------------------------------------
   0             2114              1             2642    
--------------------------------------------------------

                                 Maximum Likelihood Estimates                                   
-----------------------------------------------------------------------------------------------
               Parameter                    DF    Estimate    Std. Error    z value     Pr(>|z|) 
-----------------------------------------------------------------------------------------------
              (Intercept)                   1      0.0981        0.0939      1.0441      0.2964 
        distance_to_primary_road            1      0.0000        0.0000     -2.2037      0.0275 
       distance_to_secondary_road           1      0.0000        0.0000     -0.6184      0.5363 
       distance_to_tertiary_road            1      1e-04         0.0000      3.5609       4e-04 
            distance_to_town                1      0.0000        0.0000     -3.5553       4e-04 
              is_urbanTRUE                  1     -0.2145        0.0798     -2.6885      0.0072 
           usage_capacity1000               1     -0.6544        0.0693     -9.4488      0.0000 
water_source_cleanProtected Shallow Well    1      0.4588        0.0852      5.3869      0.0000 
   water_source_cleanProtected Spring       1      1.3151        0.4374      3.0068      0.0026 
         water_point_population             1      -5e-04        0.0000    -11.5660      0.0000 
          local_population_1km              1      3e-04         0.0000     19.3293      0.0000 
-----------------------------------------------------------------------------------------------

 Association of Predicted Probabilities and Observed Responses  
---------------------------------------------------------------
% Concordant          0.7299          Somers' D        0.4598   
% Discordant          0.2701          Gamma            0.4598   
% Tied                0.0000          Tau-a            0.2271   
Pairs                5585188          c                0.7299   
---------------------------------------------------------------

Variables with pvalue less than 0.05 are statistically significant at 95% confidence level. This leaves distance_to_secondary_road as an insignificant variable

For interpretation of logistic regression report:

  • Categorical variables: A positive value implies an above average correlation and a negative value implies a below average correlation, while the magnitude of the coefficient does not matter for categorical variables;

  • Continuous variables: a positive value implies a direct correlation and a negative value implies an inverse correlation, while the magnitude of the value gives the strength of the correlation.

blr_confusion_matrix(model,cutoff = 0.5)
Confusion Matrix and Statistics 

          Reference
Prediction FALSE TRUE
         0  1302  762
         1   812 1880

                Accuracy : 0.6690 
     No Information Rate : 0.4445 

                   Kappa : 0.3283 

McNemars's Test P-Value  : 0.2168 

             Sensitivity : 0.7116 
             Specificity : 0.6159 
          Pos Pred Value : 0.6984 
          Neg Pred Value : 0.6308 
              Prevalence : 0.5555 
          Detection Rate : 0.3953 
    Detection Prevalence : 0.5660 
       Balanced Accuracy : 0.6637 
               Precision : 0.6984 
                  Recall : 0.7116 

        'Positive' Class : 1

The validity of a cut-off is measured using sensitivity, specificity and accuracy.

  • Sensitivity: The % of correctly classified events out of all events = TP / (TP + FN)

  • Specificity: The % of correctly classified

  • Accuracy: The % of correctly classified events out of all events = (TP + TN) / (TP + FP + TN + FN)

From the output, we see that the model gives us an accuracy of 0.6739, which is a good start as it is better than guessing (0.5).

The sensitivity and specificity are 0.7207 and 0.6154 respectively. This shows that the true positives are slightly higher than the true negative prediction rates.

Building geographically weighted logistic regression models

Osun_wp_sp <- Osun_wp_sf_clean %>%
  select(c(status,
           distance_to_primary_road,
           distance_to_tertiary_road,
           distance_to_city,
           distance_to_town,
           water_point_population,
           local_population_1km,
           is_urban,
           usage_capacity,
           water_source_clean
           )) %>%
  as_Spatial()
Osun_wp_sp
class       : SpatialPointsDataFrame 
features    : 4756 
extent      : 182502.4, 290751, 340054.1, 450905.3  (xmin, xmax, ymin, ymax)
crs         : +proj=tmerc +lat_0=4 +lon_0=8.5 +k=0.99975 +x_0=670553.98 +y_0=0 +a=6378249.145 +rf=293.465 +towgs84=-92,-93,122,0,0,0,0 +units=m +no_defs 
variables   : 10
names       : status, distance_to_primary_road, distance_to_tertiary_road, distance_to_city, distance_to_town, water_point_population, local_population_1km, is_urban, usage_capacity, water_source_clean 
min values  :      0,        0.014461356813335,         0.017815121653488, 53.0461399623541, 30.0019777713073,                      0,                    0,        0,           1000,           Borehole 
max values  :      1,         26909.8616132094,          10966.2705628969,  47934.343603562, 44020.6393368124,                  29697,                36118,        1,            300,   Protected Spring 

Computing fixed bandwidth

bw.fixed <- bw.ggwr(status ~ distance_to_primary_road +
               distance_to_tertiary_road +
               distance_to_town +
               is_urban +
               usage_capacity + 
               water_source_clean + 
               water_point_population + 
               local_population_1km, 
               data = Osun_wp_sp,
               family = 'binomial',
               approach = 'AIC',
               kernel = 'gaussian',
               adaptive = FALSE,
               longlat = FALSE
                      )
bw.fixed

Using the bandwidth, we will model the geoweighted logistic regression model

gwlr.fixed <- ggwr.basic(status ~
                           distance_to_primary_road +
                           distance_to_tertiary_road +
                           distance_to_city +
                           distance_to_town +
                           water_point_population +
                           local_population_1km +
                           usage_capacity +
                           is_urban +
                           water_source_clean,
                         data = Osun_wp_sp,
                         bw = 2471.029,
                         family = "binomial",
                         kernel = "gaussian",
                         adaptive = FALSE,
                         longlat = FALSE)
 Iteration    Log-Likelihood
=========================
       0        -1954 
       1        -1675 
       2        -1527 
       3        -1443 
       4        -1407 
       5        -1407 

Converting SDF into sf dataframe

To assess the performance of the gwLR, we will convert the SDF object in as data frame by using code below.

gwr.fixed <- as.data.frame(gwlr.fixed$SDF)

Next, we will label yhat values greater or equal to 0.5 into 1 and else 0. The results of the logic comparison operation will be saved in to the field called most.

gwr.fixed <- gwr.fixed %>%
  mutate(most = ifelse(
    gwr.fixed$yhat >=0.5, T,F
  ))

gwr.fixed$y <- as.factor(gwr.fixed$y)
gwr.fixed$most <- as.factor(gwr.fixed$most)
CM <- confusionMatrix(data = gwr.fixed$most, reference = gwr.fixed$y)
CM
Confusion Matrix and Statistics

          Reference
Prediction FALSE TRUE
     FALSE  1833  268
     TRUE    281 2374
                                          
               Accuracy : 0.8846          
                 95% CI : (0.8751, 0.8935)
    No Information Rate : 0.5555          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.7661          
                                          
 Mcnemar's Test P-Value : 0.6085          
                                          
            Sensitivity : 0.8671          
            Specificity : 0.8986          
         Pos Pred Value : 0.8724          
         Neg Pred Value : 0.8942          
             Prevalence : 0.4445          
         Detection Rate : 0.3854          
   Detection Prevalence : 0.4418          
      Balanced Accuracy : 0.8828          
                                          
       'Positive' Class : FALSE           
                                          
Osun_wp_sf_selected <- Osun_wp_sf_clean %>%
  select(c(ADM2_EN, ADM2_PCODE, ADM1_EN, ADM1_PCODE, status))
gwr_sf.fixed <- cbind(Osun_wp_sf_selected, gwr.fixed)

Visualising coefficient estimates

tmap_mode("view")
tmap mode set to interactive viewing
prob_T <- tm_shape(Osun) + 
  tm_polygons(alpha = 0.1) + 
  tm_shape(gwr_sf.fixed) + 
  tm_dots(col = 'yhat',
          border.col = 'gray60',
          border.lwd = 1) +
  tm_view(set.zoom.limits = c(8,14))
prob_T